77 research outputs found
BATON: A Balanced Tree Structure for Peer-to-Peer Networks
We propose a balanced tree structure overlay on a peer-to-peer network capable of supporting both exact queries and range queries efficiently. In spite of the tree structure causing distinctions to be made between nodes at different levels in the tree, we show that the load at each node is approximately equal. In spite of the tree structure providing precisely one path between any pair of nodes, we show that sideways routing tables maintained at each node provide sufficient fault tolerance to permit efficient repair. Specifically, in a network with N nodes, we guarantee that both exact queries and range queries can be answered in O(logN) steps and also that update operations (to both data and network) have an amortized cost of O(logN). An experimental assessment validates the practicality of our proposal.Singapore-MIT Alliance (SMA
Efficient Retrieval of Similar Time Sequences Under Time Warping
Fast similarity searching in large time-sequence databases has attracted a lot of research interest. All of them use the Euclidean distance (), or some variation of metric. metrics lead to efficient indexing, thanks to feature extraction (e.g., by keeping the first few DFT coefficients) and subsequent use of fast spatial access methods for the points in feature space. In this work we examine a popular, field-tested dissimilarity function, the "time warping" distance function which permits local accelerations and decelerations in the rate of the signals or sequences. This function is natural and suitable for several applications, like matching of voice, audio and medical signals (e.g., electrocardiograms). However, from the indexing viewpoint it presents two major challenges: (a) it does not lead to any natural "features", precluding the use of spatial access methods (b) it is quadratic () on the length of the sequences involved. Here we show how to overcome both problems: for the former, we propose using a modification of the so-called "FastMap", to map sequences into points, trading off a tiny amount of "recall" (typically zero) for large gains in speed. For the latter, we provide a fast, linear test, to help us discard quickly many of the false alarms that FastMap will typically introduce. Using both ideas in cascade, our proposed method achieved up to 7.8-time speed-up over the straightforward sequential scanning, on both read and synthetic datasets
Analysis of the Clustering Properties of Hilbert Space-filling Curve
Several schemes for linear mapping of multidimensional space
have been proposed for many applications such as
access methods for spatio-temporal databases, image compression and so on.
In all these applications, one of the most desired properties
from such linear mappings is clustering, which means
the locality between objects in the multidimensional space
is preserved in the linear space.
It is widely believed that Hilbert space-filling curve
achieves the best clustering.
In this paper we provide closed-form formulas of
the number of clusters required by a given query region
of an arbitrary shape (e.g., polygons and polyhedra)
for Hilbert space-filling curve.
Both the asymptotic solution for a general case and the exact solution
for a special case generalize the previous work,
and they agree with the empirical results that
the number of clusters depends on the hyper-surface area of the query region
and not on its hyper-volume.
We have also shown that Hilbert curve achieves better clustering
than z-curve.
From the practical point of view,
the formulas given in this paper provide a simple measure
which can be used to predict the required disk access behaviors
and hence the total access time.
(Also cross-referenced as UMIACS-TR-96-20
XML Reconstruction View Selection in XML Databases: Complexity Analysis and Approximation Scheme
Query evaluation in an XML database requires reconstructing XML subtrees
rooted at nodes found by an XML query. Since XML subtree reconstruction can be
expensive, one approach to improve query response time is to use reconstruction
views - materialized XML subtrees of an XML document, whose nodes are
frequently accessed by XML queries. For this approach to be efficient, the
principal requirement is a framework for view selection. In this work, we are
the first to formalize and study the problem of XML reconstruction view
selection. The input is a tree , in which every node has a size
and profit , and the size limitation . The target is to find a subset
of subtrees rooted at nodes respectively such that
, and is maximal.
Furthermore, there is no overlap between any two subtrees selected in the
solution. We prove that this problem is NP-hard and present a fully
polynomial-time approximation scheme (FPTAS) as a solution
One-dimensional and multi-dimensional substring selectivity estimation
With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k -D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k -D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/42330/1/778-9-3-214_00090214.pd
Capturing Data Provenance from Statistical Software
We have created tools that automate one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. The C2Metadata ("Continuous Capture of Metadata for Statistical Data") Project creates a metadata workflow paralleling the data management process by deriving provenance information from scripts used to manage and transform data. C2Metadata differs from most previous data provenance initiatives by documenting transformations at the variable level rather than describing a sequence of opaque programs. Command scripts for statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogues and codebooks and to create "variable lineages" for auditing software operations. Better data documentation makes research more transparent and expands the discovery and re-use of research data
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
- …